System and method for image matting

ABSTRACT

A method and system extracts a matte from images acquired of a scene. A foreground image focused at a foreground in a scene, a background image focused at a background in the scene, and a pinhole image focused on the entire scene are acquired. These three images can be acquired sequentially by a single camera, or simultaneous by three cameras. In the later case, foreground, background and pinhole sequences of images can be acquired. The pinhole image is compared to the foreground image and the background image to extract a matte representing the scene. The comparison classifies pixels in the images as foreground, background, or unknown pixels. An optimizer minimizes an error function in the form of Fourier image equations using a gradient descent method. The error function expresses pixel intensity differences.

FIELD OF THE INVENTION

This invention relates generally to image editing, and more particularlyto matting.

BACKGROUND OF THE INVENTION

Matting and compositing are frequently used in image editing, 3Dphotography, and film production. Matting separates a foreground regionfrom an input image by estimating a color F and an opacity α for eachpixel in the image. Compositing blends the extracted foreground into anoutput image, using the matte, to represent a novel scene.

The opacity measures a ‘coverage’ of the foreground region, due toeither partial spatial coverage or partial temporal coverage, i.e.,motion blur. The set of all opacity values is called the alpha matte,the alpha channel, simply a matte.

Matting is described generally by Smith et al., “Blue screen matting,“Proceedings of the 23rd annual conference on Computer graphics andinteractive techniques,” ACM Press, pp. 259-268, and U.S. Pat. No.4,100,569, “Comprehensive electronic compositing system,” issued toVlahos on Jul. 11, 1978.

Conventional matting requires a background with known, constant color,which is referred to as blue screen matting. If a digital camera isused, then a green matte is preferred.

Blue screen matting is the predominant technique in the film andbroadcast industry. For example, broadcast studios use blue matting forpresenting weather reports. The background is a blue screen, and theforeground region includes the weatherman standing in front of the bluescreen. The foreground is extracted, and then superimposed onto aweather map so that it appears that the weatherman is actually standingin front of the map.

However, blue screen matting is costly and not readily available tocasual users. Even production studios would prefer a lower-cost and lessintrusive alternative.

Rotoscoping permits non-intrusive matting, Fleischer 1917, “Method ofproducing moving picture cartoons,” U.S. Pat. No. 1,242,674. Rotoscopinginvolves the manual drawing of a matte boundary on individual frames ofa movie.

Ideally, one would like to extract a high-quality matte from an image orvideo with an arbitrary, i.e., unknown, background. This process isknown as natural image matting.

Recently, there has been substantial progress in this area, Ruzon etal., “Alpha estimation in natural images,” CVPR, vol. 1, pp. 18-25,2000, Hillman et al., “Alpha channel estimation in high resolutionimages and image sequences,” Proceedings of IEEE CVPR 2001, IEEEComputer Society, vol. 1, pp. 1063-1068, 2001, Chuang et al., “Abayesian approach to digital matting,” Proceedings of IEEE CVPR 2001,IEEE Computer Society, vol. 2, pp. 264-271, 2001, Chuang et al., “Videomatting of complex scenes,” ACM Trans. on Graphics 21, 3, pp. 243-248,July, 2002, and Sun et al, “Poisson matting,” ACM Trans. on Graphics,August 2004.

Unfortunately, all of those methods require substantial manualintervention, which becomes prohibitive for long image sequences and fornon-professional users.

The difficulty arises because matting from a single image isfundamentally under-constrained. The matting problem considers the inputimage as a composite of a foreground layer F and a background layer B,combined using linear blending of radiance values for a pinhole camera:I _(p) [x,y]=αF+(1−α)B,   (1)

where αF is the pre-multiplied image of the foreground regions against ablack background, and B is the image of the opaque background in theabsence of the foreground.

Matting is the inverse problem of solving for the unknown values ofvariables (α, F_(r), F_(g), F_(b), B_(r), B_(g), B_(b)) given thecomposite image pixel values (I_(Pr), I_(Pg), I_(Pb)). The ‘P’ subscriptdenotes that Equation (1) holds only for a pinhole camera, i.e., wherethe entire scene is in focus. One can approximate a pinhole camera witha very small aperture. Blue screen matting is easier to solve becausethe background color B is known.

It desired to perform matting using non-intrusive techniques. That is,the scene does not need to be modified. It is also desired to performthe matting automatically. Furthermore, it is desired to providedmatting for ‘rich’ natural image, i.e., images with a lot of fine,detailed structure, such as outdoor scenes.

Most natural image matting methods require manually defined trimaps todetermine the distribution of color in the foreground and backgroundregions. A trimap segments an image into background, foreground andunknown pixels. Using the trimaps, those methods estimate likely valuesof the foreground and background colors of unknown pixels, and use thecolors to solve the matting Equation (1).

Bayesian matting, and its extension to image sequences, produce the bestresults in many applications. However, those methods require manuallydefined trimaps for key frames. This is tedious for a long imagesequences.

It is desired to provide a method that does not require userintervention, and that can operate in real-time as an image sequence isacquired.

The prior art estimation of the color distributions works only when theforeground and background are sufficiently different in a neighborhoodof an unknown pixel.

It is desired to provide a method that can extract a matte where theforeground and background pixels have substantially similar colordistributions.

The Poisson matting of Sun et al. 2004 solves a Poisson equation for thematte by assuming that the foreground and background are slowly varying.Their method interacts closely with the user by beginning from amanually constructed trimap. They also provide ‘painting’ tools tocorrect errors in the matte.

A method that acquires pixel-aligned images has been successfully usedin other computer graphics and computer vision applications, such ashigh-dynamic range (HDR) imaging, Debevec and Malik, “Recovering highdynamic range radiance maps from photographs,” Proceedings of the 24thannual conference on Computer graphics and interactive techniques, ACMPress/Addison-Wesley Publishing Co., pp. 369-378, and Branzoi, “Adaptivedynamic range imaging: Optical control of pixel exposures over space andtime,” Proceedings of the International Conference on Computer Vision(ICCV), 2003.

Another system illuminates a scene with visible light and infraredlight. Images of the scene are acquired via a beam splitter. The beamsplitter directs the visible to a visible light camera and the infraredlight to an infrared camera. That system extracts high-quality mattesfrom an environment with controlled illumination, Debevec et al., “Alighting reproduction approach to live action compositing,” ACM Trans.on Graphics 21, 3, pp. 547-556, July 2002. Similar systems have beenused in film production. However, flooding the background withartificial light is impossible for large natural outdoor scenesilluminated by ambient light.

An unassisted, natural video matting system is described by Zitnick etal., “High-quality video view interpolation using a layeredrepresentation,” ACM Trans. on Graphics 23, 3, pp. 600-608, 2004. Theyacquire videos with a horizontal row of eight cameras spaced over abouttwo meters. They measure depth discrepancies from stereo disparity usingsophisticated region processing, and then construct a trimap from thedepth discrepancies. The actual matting is determined by the Bayesianmatting of Chuang et al. However, that method has the view dependentproblems that are unavoidable with stereo cameras, e.g., reflections,specular highlights, and occlusions. It is desired to avoid viewdependent problems.

SUMMARY OF THE INVENTION

Matting is a process for extracting a high-quality alpha matte andforeground from an image or a video sequence.

Conventional techniques require either a known background, e.g., a bluescreen, or extensive manual interaction, e.g., manually specifiedforeground and background regions.

Matting is generally under-constrained, because not enough informationis obtained when the images are acquired.

The invention provides a system and method for extracting a matteautomatically from images of rich, natural scenes illuminated only byambient light.

The invention uses multiple synchronized cameras that are aligned on asingle optical axis with a single center of projection. Each camera hasthe identical view of the scene, but a different depth of field.Alternatively, a single camera can be used to acquire imagessequentially at different depths of field.

A first image or video, has the camera focused on the background, asecond image or video has the camera focused on the foreground, and athird image or video is acquired by a pinhole camera so that the entirescene is in focus.

The images are analyzed according to Fourier image formation equations,which are over-constrained and share a single point of view but differin their plane of focus. We minimize an error in the Fourier imageequations.

The invention solves the fully dynamic matting problem without manualintervention. Both the foreground and background can have high frequencycomponents and dynamic content. The foreground can resemble thebackground. The scene can be illuminated only by ambient light.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a system for extracting a matte from imagesaccording to the invention;

FIG. 2 is a flow diagram of a method for extracting a matte from imagesaccording to the invention; and

FIG. 3 is a schematic of an optical geometry with different depths offield according to the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

System Overview

FIGS. 1 and 2 shows a system 100 and method 200 according to ourinvention for automatically extracting a matte 141 from images acquiredof a scene 110 including a background region (B) 111 having a backgrounddepth of field 131, and a foreground region (F) 112 having a backgrounddepth of field 132. These can be a natural, real word indoor or outdoorscene illuminated only by ambient light.

Cameras

The images are acquired 210 by a background camera 101, a foregroundcamera 102, and a pinhole camera (P) 103. The three cameras 101-103 arealigned on a single optical axis 160, sharing a single virtual center ofprojection, using first and second beam splitters 151-152. Therefore,all cameras have an identical point of view of the scene 110. Thecameras are synchronized and connected to a processor 140.

The foreground and background cameras have relatively large apertures,resulting in small, non-overlapping depths of fields 131 and 132. Thatis, the depths of field are substantially disjoint. The pinhole camerahas a very small aperture resulting in a large depth of field 133 withthe entire scene in focus.

The foreground camera produces sharp images for the foreground regionwithin about ½ meter of depth z_(F) of a foreground image plane 162 anddefocuses regions farther away. The background camera produces sharpimages for the background region with a background plane 161 at a depthz_(B) from about four meters to infinity and defocuses the foregroundregion, see FIG. 2. The pinhole camera is nominally focused on theforeground region. It should be noted that other depth of field settingcan be used for the foreground and background cameras, depending on thestructure of the scene.

Alternatively, a single camera can be used to acquire three imagessequentially with the different aperture settings. This works forrelatively static scenes, or for slowly varying scenes if the frame rateis relatively high or the exposure time is relatively short. In this,the camera is the foreground, background, and pinhole camera as thecamera settings are changed in turn.

Our cameras respond linearly to incident radiance. We connect eachcamera to the processor 140 with a separate FireWire bus 142. Thecameras acquire images at 30 frames per second. We equip each camerawith a 50 mm lens 104. The pinhole camera is positioned after the firstbeam splitter 151. The aperture of the pinhole camera is f=12.

The pinhole camera 103 is focused on the foreground plane 162, becauseacquiring a correct matte is more important than correctlyreconstructing the background. The foreground and background camerashave f=1.6 apertures and are positioned after the second beam splitter152. Although, each camera receive only half the light of the pinholecamera, the relative large apertures acquire a relatively large amountof illumination. Therefore, the exposure for these two cameras 101-102is shorter than the exposure for the pinhole camera. As long as theacquired images are not under-exposed or over-exposed, the colorcalibration process corrects remaining intensity differences betweencameras.

Calibration

The cameras are calibrated to within a few pixels. Calibration ismaintained by software. The optical axes are aligned to eliminateparallax between cameras. Because the focus is different for thedifferent cameras, the acquired images are of different sizes. Wecorrect for this with an affine transformation. We color correct theimages by solving a similar problem in color space. Here, the featurepoints are the colors of an image of a color chart and the affinetransformation is a color matrix. We apply color and position correctionin real-time to all image sequences.

Image Sequences

For videos, each camera produces a 640×480×30 fps encoded imagesequence. The sequences of images are processed by the processor 140performing a matte extraction method 200 according to our invention.

Method Overview

FIG. 2 shows a method 200 for automatically extracting a matte accordingto the invention. Background, foreground, and pinhole sequence of images(videos) 201, 202, 203, respectively, 303 are acquired 310 of the scene110 by the cameras 101-103. It should be understood that a single cameracan be used as well, acquiring images sequentially at the appropriatedifferent depths of field.

The pixels in each pinhole image are classified as either background,foreground, or unknown by matching neighborhoods around the pixel withcorresponding neighborhood of pixels in the background and foregroundimages. The classification constructs 220 a trimap 221 for each pinholeimage. An optimization process 230 is applied to the unknown pixels. Theoptimizer minimizes an error in classifying the unknown pixels as eitherbackground or foreground pixels. This produces the matte 104.

Scene Model

We model the scene 110 as a textured foreground plane 162 with partialcoverage, and an opaque textured background plane 161. Because thebackground depth of field is larger than the foreground depth of field,and because there is no parallax between our cameras, the backgroundregion with varying depths can still be approximated as a plane for thepurpose of matting.

We pose matting as an over-constrained optimization problem. For eachpixel, there are the seven unknown “scene” values, α, F_({r,g,b}), andB_({r,g,b}), and nine constraint values I_(P{r,g,b}), I_(F{r,g,b}), andI_(B{r,g,b}) from the images I acquired by the cameras. The ‘P’subscript denotes the pinhole images, the ‘F’ subscript theforeground-focused images, and the ‘B’ subscript the background-focusedimages.

Optimizer

We solve Fourier image formation equations by minimizing an error inclassifying unknown pixels using the optimizer. To accelerateconvergence for our optimizer, we construct 220 the trimaps 221automatically using depth-from-defocus information, and select initialvalues that are likely near a true solution for the unknowns of theequations.

Initial foreground values F₀ for the optimizer are determined byautomatically assigning known foreground colors to unknown regions.Initial background values B₀ are determined by reconstructing occludedareas from neighboring images, and then ‘painting’ into always occludedregions. Initial alpha coverage values α₀ are determined by solving apinhole compositing equation using F₀ and B₀.

Defocus matting is poorly conditioned when the foreground and backgroundhave the same color, when the scene lacks high frequency components, orwhen the images are under-exposed or over-exposed. To avoid local minimaand to stabilize the optimizer in these poorly conditioned areas, we addregularization terms to our optimizer.

The core of our optimizer 230 is the error function, which is invoked afew hundred times per image. Therefore, the challenge in solving thedefocus matting by optimization is selecting an error function that isefficient to evaluate and easy to differentiate. Our error function is asum-squared pixel value error between the acquired images and compositeimages rendered from the unknowns.

Evaluating and differentiating the error function naively make theproblem intractable. To move towards a global minimum, the optimizermust find the gradient of the error function, i.e., the partialderivatives with respect to each unknown variable.

For a 320×240 pixel color image sequence at 30 fps, we need to solve forover 13 million unknowns per second. For instance, numericallyevaluating the gradient invokes the error function once for eachvariable. For our method, this involves rendering three full-resolutionimages. A very fast ray tracer may be able to render the images in threeseconds. That means a single call to the error function also takes threeseconds. Therefore, it would take years to optimize a few seconds ofvideo using conventional techniques.

Therefore, we approach the minimization as a graphics-specific problem.We symbolically manipulate expressions to avoid numerical computations.Thus, we provide a very fast approximation to the image synthesisproblem, which enables us to evaluate the error function inmilliseconds. We replace numerical evaluation of the error derivativewith a symbolic derivative based on our synthesis equations, describedbelow.

Notation

We use the following notation to compactly express discrete imagingoperations. Monochrome images are 2D matrices that have matchingdimensions. Image matrices are multiplied component wise, without amatrix multiplication. A multi-parameter image is sampled across cameraparameters, such as, wavelength λ, focus, and time t, as well as pixellocation.

We represent the multi-parameter image with a 3D or larger matrix, e.g.,C[x, y, λ, z, t]. This notation and our matting method extend to imageswith more than three color samples and to other parameters, such aspolarization, sub-pixel position, and exposure. Expressions, such asC[λ, z], where some parameters are missing, denote a sub-matrixcontaining elements corresponding to all possible values of theunspecified parameters, i.e., x, y, and t.

Generally, our equations have the same form in the x and y dimension, sowe frequently omit the parameter y. We also omit the z, λ, and tparameters when these parameters do not for a particular equation.

A convolution F{circle around (×)}G of an image F and a matrix G has thesame size as F. The convolution can be determined by extending edgevalues of F by half the size of G, so that F is well defined near theedges of F.

A disk(r)[x, y] is 1/πr² times the partial coverage of the pixel [x, y]by a disk of radius r centered on pixel [0, 0]. If the radius r<½, thenthe disk becomes a discrete impulse δ[x, y] that is one at [0, 0], andzero elsewhere.

Convolution with an impulse is the identity operation, and convolutionwith a disk is a ‘blur’ of the input image.

A vector ‘hat’ (→) above a variable denotes a multi-parameter image‘unraveled’ into a column vector along its dimensions in order, e.g.,{right arrow over (F)}[x+W((−1)+H(λ−1))]=F[x, y, λ],for an image with W×H pixels and 1-based indexing. This is equivalent toa raster scan order.

To distinguish the multi-parameter image vectors from image matrices,elements of the unraveled vectors are referenced by subscripts. Linearalgebra operators, such as matrix-vector multiplication, inverse, andtranspose operate normally on these vectors.

Defocus Composites

Equation 1 is the discrete compositing equation for a pinhole camera. Wederive an approximate compositing equation for a camera with a non-zeroaperture, which differs from a pinhole because some locations appeardefocused. In computer graphics, cameras are traditionally simulatedwith distributed ray tracing.

Instead, we instead use Fourier optics, which are well suited to ourimage-based matting problem. Defocus occurs because the cone of raysfrom a point in the scene intersects the image plane at a disk calledthe point spread function (PSF).

FIG. 3 shows the optical geometry of the situation giving rise to a PSFwith pixel radius $\begin{matrix}{{r = {\frac{f}{2\sigma\#}{{\frac{z_{R}\left( {z_{F} - f} \right)}{z_{F}\left( {z_{R} - f} \right)} - 1}}}},} & (2)\end{matrix}$

where the camera is focused at depth z_(F), a pixel at a depth z_(R), #is the f-stop number, f is the focal length, and σ is the width of apixel.

Depths z 300 are positive distances in front of the lens 104. A singleplane of points perpendicular to the lens axis with pinhole image αF hasa defocused lens image given by the convolution α{circle around (×)}Fdisk(r). Adding the background to the scene complicates matters becausethe background is partly occluded near foreground object borders.

Consider a bundle of rays emanating from a partly occluded background tothe lens. The light transport along each ray is modulated by the αvalue, where the ray intersects the foreground plane. Instead of a coneof light reaching the lens from each background point, a cone cut by theimage αF reaches the aperture. Therefore, the PSF varies for each pointon the background. The PSF is zero for occluded points, a disk forunoccluded points, and a small cut-out of the a image for partlyoccluded points. We express the PSF values for the following cases.

Pinhole

When fσ is very small, or # is very large, r is less than half a pixelat both planes and Equation 1 holds.

Focused on Background

When the background 161 is in focus, the PSF is an impulse, i.e., a zeroradius disk with finite integral. Rays in a cone from the background Bare still modulated by a disk of (1−α) at the foreground plane, but thatdisk projects to a single pixel in the final image. Only the averagevalue, and not the shape of the a disk intersected affects the finalimage. The composition equation is:I _(B)=(αF){circle around (×)}disk(r _(F))+(1−α{circle around (×)}disk(r_(F)))B.   (3)

Focused on Foreground

When the background is defocused and only the foreground is in focus,the PSF varies along the border of the foreground. Here, the correctimage expression is complicated and slow to evaluate, therefore, we usethe following approximation:I _(F) ≈αF+(1−α)(B{circle around (×)}disk(r _(B))),   (4)which blurs the background slightly at foreground borders.

A matte is a 2D matrix α[x, y], and the foreground and background imagesare respectively 3D matrices F[x, y, α] and B[x, y, α]. We generalizethe two-plane compositing expression with a function of the scene thatvaries over two discrete spatial parameters, a discrete wavelength(color channel) parameter λ, and a discrete focus parameter zε{1, 2, 3}: C(α,F,B)[x,y,λ,z]=(αF[λ]){circle around (×)}h[z]+(1−α{circle around (×)}h[z])(B[λ]{circlearound (×)}g[z]|_([x,y]),   (5)

where 3D matrices h and g encode the PSFs: $\begin{matrix}{{h\left\lbrack {x,y,z} \right\rbrack} = \left\{ {{\begin{matrix}{{\delta\left\lbrack {x,y} \right\rbrack},} & {z = 1} \\{{{{disk}\left( r_{F} \right)}\left\lbrack {x,y} \right\rbrack},} & {z = 2} \\{{\delta\left\lbrack {x,y} \right\rbrack},} & {z = 3}\end{matrix}{g\left\lbrack {x,y,z} \right\rbrack}} = \left\{ {\begin{matrix}{{\delta\left\lbrack {x,y} \right\rbrack},} & {z = 1} \\{{\delta\left\lbrack {x,y} \right\rbrack},} & {z = 2} \\{{{{disk}\left( r_{B} \right)}\left\lbrack {x,y} \right\rbrack},} & {z = 3}\end{matrix}.} \right.} \right.} & \left( {6,7} \right)\end{matrix}$

Constants r_(F) and r_(B) are the PSF radii for the foreground andbackground planes when the camera is focused on the opposite plane.

Trimap From Defocus

The trimap 221 segments the pinhole image into three mutually exclusiveand collectively exhaustive regions expressed as sets of pixels. Thesesets of pixels limit the number of unknown pixels and provide initialestimates for our optimizer 230. In contrast with the prior art, weconstruct 220 the trimaps 221 automatically as follows.

Areas in the scene that have high-frequency texture producehigh-frequency image content in the pinhole image I_(P), and either inthe foreground image I_(F) or the background image I_(B), but not both.We use this observation to classify pixels into sets of pixels withhigh-frequency neighborhoods into three regions based on the z values,which appear ‘sharp’.

Sets ΩB and ΩF contain pixels that are respectively “definitelybackground” (α=0) and “definitely foreground” (α=1). Set Ω contains“unknown” pixels that may be either foreground, background, or someblend of foreground and background. This is the set over which we solvefor extracting the matte using our optimizer.

Many surfaces with uniform macro appearance actually have finestructural elements like the pores and hair on human skin, the grain ofwood, and the rough surface of brick. This allows us to detect defocusfor many foreground objects even in the absence of strong macro texture.We use lower thresholds to detect high frequency components in thebackground, where only macro texture is visible.

We determine a first classification of the foreground and backgroundregions by measuring a relative strength of the spatial gradients:Let D=disk(max(r _(F) , r _(B)))Ω_(F1)=erode(close((|∇I _(F) |>|∇I _(B)|){circle around (×)}D>0.6,D)),D)Ω_(B1)=erode(close((|∇I _(F) |<|∇I _(B)|){circle around (×)}D>0.4,D)),D)   (8,9)where erode and close are morphological operators used to improveaccuracy. The disk is approximately the size of the PSFs. Then, weclassify the ambiguous locations either in both ′Ω_(F1) and ′Ω_(B1) orin neither:Ω={tilde over (Ω)}_(F1)∪{tilde over (Ω)}_(B1)∪(Ω_(F1)∩Ω_(B1)).   (10)

Finally, we enforce the mutual exclusion property:Ω_(F)=Ω_(F1)∩{tilde over (Ω)}Ω_(B)=Ω_(B1)∩{tilde over (Ω)}.   (11,12)

Minimization Errors in Classifying Unknown Pixels

We pose matting as an error minimization problem for each image, andsolve the problem independently for each image. Assume we know theapproximate depths of the foreground and background planes and allcamera parameters. These are reasonable assumptions because digitalcameras directly measure their parameters. From the lens to sensordistance we can derive the depths to the planes, if otherwise unknown.

The foreground and background need not be perfect planes, they just needto lie within the foreground and background depth fields. Because thedepth of field is related hyperbolically to depth, the background depthfield can stretch to infinity.

Let u=[{right arrow over (α)}^(T){right arrow over (B)}^(T){right arrowover (F)}^(T)]^(T) be the column vector describing the entire scene,i.e., the unknown pixels in the matting problem, and {right arrow over(C)}(u) be the unraveled composition function from Equation 5.

The unraveled constraints are {right arrow over (I)}=[{right arrow over(I)}_(P) ^(T){right arrow over (I)}_(B) ^(T){right arrow over (I)}_(F)^(T)]^(T). The solution to the matting problem is a scene u* for whichthe norm of the error vector {right arrow over (E)}(u)={right arrow over(C)}(u)−{right arrow over (I)} is minimized according: $\begin{matrix}{{{{Let}\quad{Q(u)}} = {\sum\limits_{k}{\frac{1}{2}{{\overset{\rightarrow}{E}}_{k}^{2}(u)}}}}{u^{*} = {\underset{u}{\arg\quad\min}\quad{{Q(u)}.}}}} & \left( {13,14} \right)\end{matrix}$

Note that the scalar-valued function Q is quadratic because the functionQ contains the terms of the form (α[x]F[i])².

Iterative solvers appropriate for minimizing such a large systemevaluate a given scene u and select a new scene u+Δu as a function ofthe vector {right arrow over (E)}(u), and a Jacobian matrix J(u). TheJacobian matrix contains the partial derivative of each element of thevector with respect to each element of u, $\begin{matrix}{{J_{k,n}(u)} = {\frac{\partial{{\overset{\rightarrow}{E}}_{k}(u)}}{\partial u_{n}}.}} & (15)\end{matrix}$

The value k is an index into the unraveled constraints, and the value nis an index into the unraveled unknown array. Henceforth, we write{right arrow over (E)} rather than {right arrow over (E)}(u), and so on,for the other functions of u to simplify the notation in the presence ofsubscripts.

A gradient descent solver moves opposite the gradient of Q:$\begin{matrix}{{{\Delta\quad u} = {{- {\nabla Q}} = {- {\nabla\quad{\sum\limits_{k}{\frac{1}{2}{\overset{\rightarrow}{E}}_{k}^{2}}}}}}}{{{so}\quad\Delta\quad u_{n}} = {{- \frac{\partial{\sum\limits_{k}{\frac{1}{2}{\overset{\rightarrow}{E}}_{k}^{2}}}}{\partial u_{n}}} = {- {\sum\limits_{k}\left( {{\overset{\rightarrow}{E}}_{k}\frac{\partial{\overset{\rightarrow}{E}}_{k}}{\partial u_{n}}} \right)}}}}{hence}} & \left( {16,17} \right) \\{{\Delta\quad u} = {{- {\overset{\rightarrow}{E}}^{T}}{J.}}} & (18)\end{matrix}$

The gradient descent solver has a space advantage over other methodslike Gauss-Newton and Levenberg-Marquardt because the gradient functiondoes not need to determine the pseudo-inverse of the Jacobian matrix J.This is important because the vectors and matrices involved are verylarge.

Let N be the number of unknown pixels and K be the number of constrainedpixels. For 320×240 images, the matrix J has about 6×10⁹ elements.

We now derive a simple expression for the elements of the Jacobianmatrix and determine that the matrix is sparse, so determining Au isfeasible when we do not need the non-sparse inverse of the matrix J. Bydefinition, the elements are: $\begin{matrix}{J_{k,n} = {\frac{\partial\left( {{{\overset{\rightarrow}{C}}_{k}(u)} - {\overset{\rightarrow}{I}}_{k}} \right)}{\partial u_{n}} = {\frac{\partial{{\overset{\rightarrow}{C}}_{k}(u)}}{\partial u_{n}}.}}} & (19)\end{matrix}$

To evaluate Equation 19, we expand the convolution from Equation 5. Wechange variables from packed 1D vectors indexed by k to images indexedby $\begin{matrix}{{{C\left\lbrack {x,z,\lambda} \right\rbrack}{\sum\limits_{s}{{\alpha\lbrack s\rbrack}{F\left\lbrack {s,\lambda} \right\rbrack}{h\left\lbrack {{x - s},z} \right\rbrack}}}} + {\left( {1 - {\sum\limits_{s}{{\alpha\lbrack s\rbrack}{h\left\lbrack {{x - s},z} \right\rbrack}}}} \right){\sum\limits_{s}{{B\left\lbrack {s,\lambda} \right\rbrack}{{g\left\lbrack {{x - s},z} \right\rbrack}.}}}}} & (20)\end{matrix}$

An examination of this expansion shows that the matrix J is both sparseand simple. For example, consider the case where unknown pixel u_(n)corresponds to F[i,λ]. In a full expansion of Equation 20, only one termcontains F[i, λ], so the partial derivative contains only one term:$\begin{matrix}{\frac{\partial{C\left\lbrack {x,\lambda,z} \right\rbrack}}{\partial{F\left\lbrack {i,\lambda} \right\rbrack}} = {{\alpha\lbrack i\rbrack}{{h\left\lbrack {{x - i},z} \right\rbrack}.}}} & (21)\end{matrix}$

The expressions for the α and B derivatives are only slightly morecomplicated, with potentially non-zero elements only at: $\begin{matrix}{\begin{matrix}{\frac{\partial{C\left\lbrack {x,\lambda,z} \right\rbrack}}{\partial{\alpha\lbrack i\rbrack}} = {{h\left\lbrack {{x - i},z} \right\rbrack}\quad\left( {{F\left\lbrack {i,\lambda} \right\rbrack} - {\sum\limits_{s}{{B\left\lbrack {\lambda,s} \right\rbrack}\quad{g\left\lbrack {{x - s},z} \right\rbrack}}}} \right)}} \\{= {{h\left\lbrack {{x - i},z} \right\rbrack}\quad\left( {{F\left\lbrack {i,\lambda} \right\rbrack} - {\left( {{B\lbrack\lambda\rbrack} \otimes {g\lbrack z\rbrack}} \right)\lbrack x\rbrack}} \right)}}\end{matrix}{\begin{matrix}{{\frac{\partial{C\left\lbrack {x,\lambda,z} \right\rbrack}}{\partial{B\left\lbrack {i,\lambda} \right\rbrack}} = {{g\left\lbrack {{x - i},z} \right\rbrack}\quad\left( {1 - {\sum\limits_{s}{{\alpha\lbrack s\rbrack}\quad{h\left\lbrack {{x - s},z} \right\rbrack}}}} \right)}}\quad} \\{= {{g\left\lbrack {{x - i},z} \right\rbrack}\quad\left( {1 - {\left( {\alpha \otimes {h\lbrack z\rbrack}} \right)\lbrack x\rbrack}} \right)}}\end{matrix}.}} & \left( {22,23} \right)\end{matrix}$

The summations in the last two cases are just elements of convolutionterms that appear in {right arrow over (E)}, so there is no additionalcost for computing these values.

Trust Region and Weights

The gradient indicate a direction to change u to reduce the error. Weuse a so-called dogleg trust region scheme to select the magnitude, seeNocedal and Wright, IEEE PAMI 18, 12, pp. 1186-1198, Springer Verlag.The idea is to take the largest step that decreases the error. We beginwith a trust region of radius S=1.

Let u′=max(0, min(1, u+(SΔu/|Δu|)). If |{right arrow over(E)}(u′)|<{right arrow over (E)}(u), then, we assume we have notovershot the minimum and repeatedly double S until the error increasesabove the lowest level seen this iteration. If |{right arrow over(E)}(u′)|<{right arrow over (E)}(u), then we assume we have overshot andtake the opposite action, repeatedly halving S until we pass the lowesterror in this iteration. If S becomes very small, e.g., 10⁻¹⁰ or theerror norm decreases by less than 0.1%, then we assume that we are atthe local minimum and terminate the optimization process.

Because our initial estimates are frequently good, we weigh the first Nelements of Δu by constant β_(α)=3 to influence the optimizer to takelarger steps in α. This speeds convergence without shifting the globalminimum.

The narrow aperture and long exposure used to acquire the pinhole imagesproduce more noise and motion blur than in the foreground and backgroundimages I_(F) and I_(B). This prevents over-fitting the noise. This alsoreduces the over-representation in E of in-focus pixels that occursbecause image F and B are in focus in two of the constraint images anddefocused in one each.

Regularization

In foreground regions that are low frequency or visually similar to thebackground, there are many values of u that satisfy the constraints. Webias the optimizer towards likely solutions. This is regularization ofthe optimization problem, which corresponds to having a different priorprobability for a maximum likelihood problem. Regularization also toavoid local minima in the error function and stabilizes the optimizer inregions where the global minimum is in a ‘flat’ region that has manypossible solutions.

We extend the error vector {right arrow over (E)} with p new entries,each entry corresponding to the magnitude of a 7N-componentregularization vector. Calling these regularization vectors ε, φ, γ, . .. , the error function Q now has the form: $\begin{matrix}\begin{matrix}{{Q(u)} = {\sum\limits_{k}{\quad\overset{\rightarrow}{E}}_{k}^{\quad 2}}} \\{= {\left\lbrack {\sum\limits_{k = 1}^{9K}{\overset{\rightarrow}{E}}_{k}^{2}} \right\rbrack + {\overset{\rightarrow}{E}}_{{9K} + 1}^{2} + {\overset{\rightarrow}{E}}_{{9K} + 2}^{2} + \ldots}} \\{= {{\sum\limits_{k = 1}^{9K}{\overset{\rightarrow}{E}}_{k}^{2}} + {\beta_{1}\frac{9K}{7N}{\sum\limits_{n}^{7N}ɛ_{n}^{2}}} + {\beta_{2}\frac{9K}{7N}{\sum\limits_{n}^{7N}\phi_{n}^{2}}} + \ldots}}\end{matrix} & (24)\end{matrix}$

The regularization vectors are e. Each summation over n appears as a newrow in the error vector {right arrow over (E)} and the matrix J for somek>9K: $\begin{matrix}{{{\overset{->}{E}}_{k} = \left( {\beta\frac{9K}{7N}{\sum\limits_{n}e_{n}^{2}}} \right)^{\frac{1}{2}}}{J_{k,n} = {\frac{\partial{\overset{->}{E}}_{k}}{\partial u_{n}} = {\frac{\beta}{{\overset{->}{E}}_{k}}\frac{9K}{7N}{\sum\limits_{i}{\left\lbrack {e_{i}\frac{\partial e_{i}}{\partial u_{n}}} \right\rbrack.}}}}}} & \left( {25,26} \right)\end{matrix}$

The factor $\frac{9K}{7N}$makes the regularization magnitude invariant to the ratio of constraintsto unknown pixels, and the scaling factor β allows us to control itssignificance.

We select regularization vectors that are both easy to differentiate andefficient to evaluate, i.e., the summations over i generally containonly one non-zero term.

Regularization influences the optimizer to the most likely of manysolutions supported by the image data, but rarely leads to anunsupported solution. We use small weights on the order of β=0.05 foreach term to avoid shifting the global minimum.

Coherence

The spatial gradients are small, $\begin{matrix}{{e_{n} = \frac{\partial u_{n}}{\partial x}};{\left( {{\overset{\rightarrow}{E}}^{T}J} \right)_{n} = {- {\frac{\partial^{2}u_{n}}{\partial x^{2}}.T}}}} & (27)\end{matrix}$

We apply separate coherence terms to α, F, and B, for each color channeland for directions x and y. The alpha gradient constraints are relaxedat edges in the image. The F gradient constraints are increased by afactor of ten, where |∇α| is large. These constraints allow sharpforeground edges and prevent noise in the foreground image F where it isill-defined.

Discrimination

The value α is distributed mostly at 0 and 1,e _(n) =u _(n) −u _(n) ²;({right arrow over (E)} ^(T) J) _(n)=(u _(n) −u_(n) ²)(1−2u _(n))|1≦n≦N.   (28)

Background Frequencies Should Appear in B: $\begin{matrix}{{{{Let}\quad G} = {I_{B} - {I_{F} \otimes {{disk}\left( r_{F} \right)}}}}{{e_{n} = {\frac{\partial u_{n}}{\partial x} - \frac{\partial{\overset{->}{G}}_{n}}{\partial x}}};{\left( {{\overset{->}{E}}^{T}J} \right)_{n} = {{- \frac{\partial^{2}u_{n}}{\partial x^{2}}}❘{{{4N} + 1} \leq {7{N.}}}}}}} & (29)\end{matrix}$

Other Applications

Artificial Depth of Field

We can matte a new foreground onto the reconstructed background, butselect the point spread functions and transformations arbitrarily. Thisenables us to render images with virtual depth of field, and even slighttranslation and zoom.

Image Filtering

Defocus is not the only effect we can apply when recompositing againstthe original background image. Any filter can be used to process theforeground and background separately using the matte as a selectionregion, e.g., hue adjustment, painterly rendering, motion blur, ordeblur.

Although the invention has been described by way of examples ofpreferred embodiments, it is to be understood that various otheradaptations and modifications may be made within the spirit and scope ofthe invention. Therefore, it is the object of the appended claims tocover all such variations and modifications as come within the truespirit and scope of the invention.

1. A computerized method for extracting a matte from images acquired ofa scene, comprising the steps of: acquiring a foreground image focusedat a foreground in a scene; acquiring a background image focused at abackground in the scene; acquiring a pinhole image focused on the entirescene; comparing the pinhole image to the foreground image and thebackground image to extract a matte representing the scene.
 2. Themethod of claim 1, in which the foreground, background, and pinholeimages are acquired sequentially by single camera.
 3. The method ofclaim 1, in which the foreground image, the background image, and thepinhole image are acquired simultaneously by a foreground camera, abackground camera, and a pinhole camera.
 4. The method of claim 3,aligning the foreground camera, the background camera, and the pinholecamera on a single optical axis sharing a single virtual center ofprojection.
 5. The method of claim 3, in which a sequence of foregroundimages are acquired by the foreground camera, a sequence of backgroundimages are acquired by the background camera, and a sequence of pinholeimages are acquired by the pinhole camera, and further comprising:comparing each pinhole image with the corresponding foreground image andthe corresponding background image to construct a sequence of mattesrepresenting the scene.
 6. The method of claim 1, in which a depth offield for the foreground image corresponding to the foreground in thescene, a depth of field of the background image corresponds to thebackground in the scene, and a depth of field of the pinhole imagecorresponds to the entire scene.
 7. The method of claim 1, furthercomprising: classifying a particular pixel in the pinhole image as aforeground pixel when a neighborhood of pixels about the particularpixel matches a corresponding neighborhood of pixels in the foregroundimage, classifying the particular pixel in the pinhole image as abackground pixel when the neighborhood of pixels about the particularpixel matches a corresponding neighborhood of pixels in the backgroundimage, and other wise classifying the particular pixel in the pinholeimage as an unknown pixel to construct a trimap.
 8. The method of claim1, further comprising: applying an optimizer including an error functionto classify the unknown pixels; and minimizing the error function usinga gradient solver.
 9. The method of claim 8, in which the error functiondescribes intensity difference between pixels of the pinhole image andpixels of the foreground image and pixels of the background image. 10.The method of claim 8, in which a derivative of the error functionapproximates a symbolic expression.
 11. The method of claim 1, in whichthe scene is real world natural scene illuminated only by ambient light.12. The method of claim 3, in which the foreground camera, thebackground camera, and the pinhole camera are aligned by using a firstbeam splitter and a second beam splitter.
 13. The method of claim 6, inwhich the depth of field of the foreground camera is substantiallydisjoint from the depth of field of the background camera.
 14. Themethod of claim 3, in which apertures of the foreground camera and thebackground camera are relatively large compared to an aperture of thepinhole camera.
 15. The method of claim 8, in which the error functionis expressed as Fourier image formation equations.
 16. The method ofclaim 8, further comprising: regularizing the optimizer to avoid localminima in the error function and stabilizes the optimizer when a globalminimum of the error function has many possible solutions.
 17. Themethod of claim 7, further comprising: reconstructing the backgroundimage from only the background pixels; supplying a new foreground image;and compositing the reconstructed background image and the newforeground image according to the matte.
 18. The method of claim 17,further comprising: filtering the reconstructed background image, andthe new foreground image according to the matte.
 19. A system forextracting a matte from images acquired of a scene, comprising: aforeground camera configured to acquire a foreground image focused at aforeground in a scene; a background camera configured to acquire abackground image focused at a background in the scene; a pinhole cameraconfigured to acquire a pinhole image focused on the entire scene; meansfor comparing the pinhole image to the foreground image and thebackground image to extract a matte representing the scene.
 20. Thesystem of claim 19, further comprising: first and second beam splitterconfigured to align the foreground camera, the background camera, andthe pinhole camera on a single optical axis sharing a single virtualcenter of projection.